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Abstract 

We show that the Hedge algorithm, a method that is widely used in Machine Learning, can 
be interpreted as a particular instance of Dual Averaging schemes, which have recently been 
introduced by Nesterov for regret minimization. Based on this interpretation, we establish 
three alternative methods of the Hedge algorithm: one in the form of the original method, but 
with optimal parameters, one that requires less a priori information, and one that is better 
adapted to the context of the Hedge algorithm. All our modified methods have convergence 
results that are better or at least as good as the performance guarantees of the vanilla method. 
In numerical experiments, our methods significantly outperform the original scheme. 

1 Introduction 

The Hedge algorithm was introduced by Freund and Schapire jFS97j and encompasses many well- 
known schemes in Machine Learning. For instance, as Freund and Schapire showed, this method is 
related to the now widely used AdaBoost algorithm jFS97j . The Hedge algorithm can be used to 
solve the following online allocation problem. We want to invest an amount of money in a portfolio 
consisting of different assets at the stock market. After each time step, we can modify the current 
composition of our portfolio. The Hedge algorithm defines an update strategy for our portfolio, 
such that the average performance that we achieve is not much worse than the average performance 
of the most favorable investment product. The portfolio update rule is based on the current loss 
(or gain) that is associated with every investment product. 

In this paper, we propose an alternative viewpoint on the Hedge algorithm, using methods that 
have recently been introduced in Convex Optimization. It is well-known that the Hedge algorithm 
can be interpreted as a Mirror-Descent scheme [NY83] with an entropy-type prox-function; see for 
instance Chapter 11 in |CBL06j . However, this interpretation has two drawbacks. First, Mirror- 
Descent schemes require the definition of a convex and closed objective function. In this setting, the 
current loss of the investment products corresponds to a subgradient of this objective function. In 
particular, we explicitly rule out the possibility of a dynamic objective function with this approach. 
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However, modeling the performance of a portfolio with a static objective function, even when we 
allow random losses, is at best questionable. As the last financial crisis has shown, significant sudden 
changes in the performance of an investment product can appear, that are more appropriately 
modeled with a dynamic objective function. Second, in order to ensure convergence, Mirror-Descent 
schemes need to consider subgradients with more weight the earlier they appear. However, common 
sense dictates that recent losses contain more relevant information on the future development of 
the stock market than losses occurred years ago. In this paper, we interpret the Hedge algorithm 
as a Dual Averaging scheme |Ncs09 . 

Dual Averaging schemes are the natural extension of Mirror-Descent methods and get rid of both 
deficiencies we pointed out above at the same time. When applied to our context, Dual Averaging 
schemes do not make any assumptions on the construction of the losses. For instance, they can be 
chosen in adversarial way with respect to our current portfolio, they can be randomly generated, or 
- which reflects some of the latest events at the stock market more accurately - their construction 
rule may dynamically change. Moreover, in Dual Averaging schemes, we can give more weight to 
the latest losses, which allows to react much faster to significant changes in the market behavior. 

Based on this alternative interpretation of the Hedge algorithm, we give three modifications of 
the Hedge algorithm, namely the Optimal Hedge algorithm, the Optimal Time-Independent Hedge 
algorithm, and the Optimal Aggressive Hedge algorithm. All these methods have convergence 
results that are better or at least as good as the convergence guarantee for the vanilla Hedge 
algorithm. The Optimal Hedge algorithm has the same form as the original Hedge algorithm, 
except that all method parameters are chosen in an optimal way. The Optimal Time-Independent 
Hedge algorithm requires less a priori information than the Optimal Hedge algorithm. Finally, 
the Optimal Aggressive Hedge algorithm considers losses as more relevant the later they appear. 
Numerical results show that all our alternative methods perform better than the vanilla Hedge 
algorithm. More interestingly, using the Optimal Aggressive Hedge algorithm, we end up with an 
average benefit that is even better than the profit of the single most favorable investment product, 
provided that the losses incur shocks reverting the performance of assets. This effect would not 
have been possible with a static objective function. 

This paper is organized as follows. In Sections [2] and |3l we review Dual Averaging schemes and 
the original Hedge algorithm. We show in Section U that the Hedge algorithm is a Dual Averaging 
scheme and suggest several alternative methods based on this interpretation. We conclude this 
paper with some numerical results in Section [5l 

2 Dual Averaging methods 

We give a brief review of Dual Averaging schemes, which were introduced by Nesterov in |Ncs 09 . 

Let Q C R™ be a closed and convex set. We assume that we have at our disposal an oracle G, 
which returns a vector g = Q{x) G R ra for input x G Q- We interpret the oracle output g = Q(x) as 
a loss vector that is associated to x. The corresponding loss is defined as (g, x), where (•, •) denotes 
the standard dot product in R". Assume now that we repeat this process. That is, for t G N, we 
choose an element x t G Q, call the oracle Q with input xt, observe the loss vector gt = Q{xt) G K", 
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and update our choice of the element Xt+i G Q. After T rounds, we obtain a total averaged loss of 



1 T_1 

where the numbers Ao,..-,At-i > can be seen as a tool to weight the losses according to 
their appearance. We can compare Ct to the averaged loss Y^t=o (9t,x) /Y^k=o ^ fc > wri ere x 
corresponds to an element in Q that turns out to be optimal in hindsight. The deviation of this 
two quantities is called averaged regret and denoted by TZt'- 

n T := T \ [J2 At (9t,xt) - mm j ^ A t (g t ,x)V\ = ^ max j ]T A f (<?*, a* - a) 1 . 
Z^fe=o Afc \t=o lt=o J/ 2^fc=o Afe lt=o J 

If the oracle Q is associated to a convex optimization problem of the form min^gQ f(x), that is, the 
oracle return correspond to subgradients of /, the averaged regret TZt gives us an upper bound on 
the optimality gap min < t <T-i f{x t ) - min xe Q f(x). 

Naturally, the following question arises: is there a strategy to update the elements xq, . . . , xt-i 
such that the averaged regret TZt is bounded from above by a quantity that converges to zero when 
T goes to infinity? Nesterov's Dual Averaging schemes |Nes09| can be used to define such update 
strategies. 

We equip M" with a norm ||-||, not necessarily the norm associated with (•,•), and denote by 
H'll^ the corresponding dual norm. Nesterov's Dual Averaging methods require a prox- function 
d : Q — > R, that is, a function that is continuous and strongly convex modulus a > with respect 
to J|-j| on Q. We set xq = argmin^gQ d{x). Without loss of generality, we may assume that a = 1 
and that d vanishes at xq. The algorithm accumulates all the loss vectors in a dual variable Sf+i, 
that is, s t+ i = — X)fc=o ^kgk for any t = 0, . . . ,T— 1. In order to define Xt+i, the dual variable St+i 
is then projected back on the set Q using the parametrized mirror- operator 

ttqA+i : R n -> Q : s i-^ argmax{(s, x - x a ) - /3 t+1 d(x)} , 

xeQ 

where pt+i > is some projection parameter. We assume that d is chosen in such a way that the 
above optimization problem is easily solvable. The resulting scheme looks as follows. 



Algorithm 1 Dual Averaging methods |Nes09j 
l: Set So = an d xq — argmin-rgQ d(x). 

2: Choose positive weights {Xt}t>o and a non-decrcasing sequence {/?t}t>o of positive projection 

parameters. 
3: for t = 0,...,T- 1 do 

4: Call the oracle Q to get a loss vector g t = Q{xt) G M™. 

5: Set s t+ i =s t — \t9t- 

6: Compute x t+ i := 7TQ i/3t+1 (s t+ i). 

7: end for 



Nesterov proved the following result for this method. 
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Theorem 2.1 (First part of Theorem 1 in [Nes09j ) For any D > 0, we have: 

1 f T_1 1 1 / 1 T_1 A 2 \ 

max ^ A t (gt, x t - g) : d(a) < D < i (3 T D + ^HWl ■ (1) 

2^fc=o A fe xey I t=o J 2^fc=o A fc \ z t=o Pt j 

Let us assume that the oracle returns are uniformly bounded, that is, there exist a constant L 
such that \\gt\\* < L for any £ = 0, . . . ,T — 1. The above theorem motivates several ways to choose 
the weights At and the projection parameters f3f For instance, we can set /3 t = 1 for any t and 
choose constant weights At = A* in such a way that the right-hand side in (JT|) is minimized. That 
is, provided that T is hxed in advance, we set A* = (1/L)y/2D/T, for which the right-hand side 
in (JT|) becomes Ly / 2D/T. Moreover, Nesterov observed in |Nes09j that for fit = 1 the right-hand 
side in ([1]) converges to zero as long as as J2t=o A * diverges and J2t=o A t converges when T goes to 
infinity. The latter condition implies that the weights At converge to zero. Selecting the /3t's in an 
appropriate way, we can allow non-decreasing weights At while still ensuring that the right-hand 
side in (JlJ converges to zero when T goes to infinity. For instance, as Nesterov suggested in [Ncs09 , 
we can set 

./on * i 

At = ^— , fo = l, and t+1 = £ - V t > 0, (2) 

k=0 

for which the right-hand side in ((TJ) is still in O \ L^/dJt\ . The same asymptotic bound can be 
guaranteed for A t = (t + 1) 2 VW/L and /3 t = t 2 b for each t > 0. 



3 The Hedge algorithm 

The Hedge algorithm [FS97] is a generic method that encompasses many well-known schemes in 
Machine Learning. As examples, Multiplicative Weights Update methods arc a variation of the 
Hedge algorithm (see [AHK05] for a survey) and AdaBoost can be related to the Hedge algorithm 
(see |FS97j for more details). 

The problem the Hedge algorithm aims at solving can be described as follows. We assume that 
we want to invest a certain amount of money at the stock market. We have at our disposal a basket 
of n investment products such as shares, currencies, gold, raw materials, real estates, and so on. Let 
us denote by Xt j2 : > the share of our initial amount of money that we invest in product i at time t, 
where i = 1, . . . , n and t > 0. We always invest all of our money, that is, we assume X)™=i Xt ,i = ^ 
for all t > 0. At every time step t > 0, we can evaluate the loss (or gain) t t i corresponding to the 
investment product i, where we assume £ t i € [— fj,, p] for every t > and any i = 1, . . . , n. Thus, 
given our portfolio x t at time t, we suffer a loss of (£ t , x t ) at this time step. The Hedge algorithm 
defines now an update strategy for our portfolio such that the averaged loss Ym=o {^ti x t) /T that 
we face is not much worse than the averaged total loss mini<i<„ Y^t=o ^-t,i/T of the investment 
product with the best performance. 

The Hedge scheme evaluates the losses through a decreasing score function U : [—(J,, p] — > (0, 1]. 
For the sake of brevity, we focus in this paper only on score functions of the form U(z) = 7 az+b , 
where 7 € (0, 1), a > 0, and kR are some parameters whose choices we discuss in detail afterwards. 
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The Hedge algorithm assigns a weight Wtj to every investment product 1 < i < n and for every 
time step t > 0. The current weight of investment product i depends on its initial weight and on its 
performance in the past. More concretely, it is defined as w t +\ t i :— wt,iU{tt,i)- The portfolio x t +i 
is then given by the normalization of the weight vector iuj+i. The full method takes the following 
form. 

Algorithm 2 Hedge algorithm |FS97j 

1: Let 7 i (0, 1), a > 0, b G E, and w = (1/n, . . . , 1/n) G R n . 

2: for t = 0,...,T- 1 do 

3: Set x t = W t / J2i=l w t,i- 

4: Observe loss vector l t - 

5: Set Wt+is = j ait , i+b for any i — 1, . . . , n. 

6: end for 



Freund and Schapire studied the convergence behavior of Algorithm [2] In their paper, they 
considered the situation where [i = and p = 1 . The immediate extension of their reasoning to our 
more general setting yields to the following result. 

Theorem 3.1 (Extension of Theorem 2 in |FS97p With a = l/(fj, + p) and b = p/(p + p), 
the sequence (x t )J = Q generated by Algorithm^ satisfies 

i>+<^>)<^-^ mm (eV^V (3) 

*■ — ' 1 — ■y 1 — T Ki<n \ L — J I 

t=0 ' ' — \t=0 / 

As mentioned in |FS97j . the above theorem can be extended to any decreasing score function 
U : [— p, p] — >• R that complies with the condition 

7 5ft <Wz)<l-(l-z)Z±£ Vze[-/i,p], (4) 

p + p 

Optimizing the right-hand side of <j3j) with respect to 7, that is, setting 7=1/ (^\/2 ln(?i)/T + l^j , 
we obtain the score function 

U : [-n, p]^R:z^^2 ln(n)/T + l) ~~ , (5) 

for which one can prove the following statement using Theorem 13. II see [FS97] for more details on 
the derivation of this score function. 

Corollary 3.1 (Consequence of Lemma 4 in [FS97]) With the above score function, we have: 
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4 The Hedge algorithm is a Dual Averaging method 



Wc show that wc can recast the Hedge algorithm in the framework of Dual Averaging schemes and 
derive alternative versions of the original method. 

We define Q as the (n — 1) -dimensional standard simplex A„ = {x G R" : x > 0, Y^i=i x i = 1}' 
so that Q encompasses all possible portfolios. Wc equip K™ with the norm ||a;|| 1 := Y^7=i \ x i\- f° r 
which the corresponding dual norm is of the form ||s|| := maxi<;<„ |s,|. Moreover, we endow 
Algorithm [T] with the prox- function 

n 

d,A„ ■ A„ — > M : x m> ln(n) + x t ln(a;j). 

i=l 

It is well-known that this function complies with our assumptions on prox-functions, that is, it 
is continuous and strongly convex modulus 1 with respect to H-j^ on A„, it attains its center 
xq := argmin^gA,, d& n (x) at (l/ n i • • • j arL d it vanishes at this point; see for instance [Ncs09 

and the references therein. Moreover, we can explicitly write the corresponding parametrized mirror- 
operator: 

I \ ( exp(s,//3) \ 

Given a loss vector i t € K™, that is, corresponds to the loss of investment product i at time t, 
we evaluate this vector trough an affinc function z > az + 6, where a > and 6 € K. We interpret 
the resulting vector g t := (a-£t,j + &)- =1 as the output Q{x t ) of the oracle Q in Algorithm [T] for input 
vector x t € K™. Note that we do not specify any construction rule for the loss vector £ t . For instance, 
they could be chosen randomly or in an adversarial way with respect to the portfolio x t . Algorithm!]] 
takes the following form for our setting, where we express the parametrized mirror-operator tta„,/3 
in a form that makes the comparison of the resulting method with the Hedge algorithm rather 
transparent. 



Algorithm 3 Extended Hedge algorithm 
1: Choose positive weights {Xt}t>o and a non-decreasing sequence {/3t}t>o of positive projection 
parameters. 

2: Let a > 0, b e K, and w = (1/n, . . . , l/n) E W l . 
3: for t = 0,...,T- 1 do 

4: Set X t = W t j YTi=\ w t,i- 

5: Observe loss vector l t . 

6: Set Wt+i,i = exp ( — M°it,,+ fc ) j w ^ ft f or any ^ — 1 , ... ; 71. 
7: end for 



Let us now discuss several strategies for choosing the weights "ft, the projection parameters /3t, 
and the affine function z 1— > az + b in Algorithm [3l However, first we observe that the norm of 
each oracle return ait + b and the prox-function g?a„ are bounded from above by the quantities 
L(a, b) := max{|— afi + b\ , \ap + b\} and by D := ln(n), respectively. 

Original Hedge algorithm: If f3 t = 1 and A t = ln(l/7) for any t = 0, . . . ,T — 1 and with a 



6 



fixed 7 £ (0,1), we recover the Hedge algorithm. This implies that the Hedge algorithm is Dual 
Averaging scheme. 

Optimal Hedge algorithm: Theorem 12 . II yields for these weights and projection parameters: 
I max g (i uXt - x) < (d + i g m»(l/ 7 )L»(o, 5)) . 

This bound is minimized by parameters 7*, a*, and 6* that satisfy 



/ 2 / 21n(n) \ (i — p 

7 = cxp I^^T^) V -r- J and 5 = — a ' 

where a* > 0. We refer to Algorithm [3] with the just specified setting as Optimal Hedge algorithm, 
for which we have by the above inequality: 



T-l 



1 id \ / V + P /InW 

— max N 



.. V (l u x t -x)< 
t=o 

This result improves Bound ^ by the additive quantity [ji + p)ln(n)/T and by a multiplicative 
factor of 2. Note that the resulting score function U(z) = ( 7 *) a z+b does not comply with Condition 
((4]). Therefore, neither Theorem 13 . 1 1 nor its extension can be used to establish the above bound. 

Optimal Time-Independent Hedge algorithm: The update parameter 7 depends on the 
number of iterations T in both algorithms, the Original Hedge algorithm with the score function 
(O suggested by Freund and Schapire |FS97| and the Optimal Hedge algorithm. However, when 
investing our money at the stock market, we might not want to fix the number of times that we adapt 
our portfolio in advance. We thus need an update parameter that is independent of T. Adapting 
Nesterov's strategy ((2]), we choose 7 £ (0, 1) and set A* := ln(l/ 7 ), /?o = 1, and (3t+i = Ylk=o ^-/Pk 
for any t > 0. Applying Theorem 12. 1[ we obtain for any T > 1: 



We minimize the right-hand side of the above inequality, that is, we choose a* > and set 



7 =cxp [--^Tri) and b =— a - 

Exploiting Lemma 3 in INcs09j , we obtain for the resulting method, which we refer to as the Optimal 
Time- Independent Hedge algorithm, the following inequalities: 



t=o 



Optimal Aggressive Hedge algorithm: The later a loss appears, the more likely it is that 
this loss vector contains relevant information for the future development of the investment prod- 
ucts' performances. We conclude this section by introducing an alternative version of the Hedge 
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30 investment products (fj, = 0.5133, p = 0.5175) 



Number of iterations 


7800 


15600 


23400 


31200 


w.r.t. 

best product 


Best investment product 


-0.0045 


-0.0034 


-0.0081 


-0.0110 




Original Hedge 


0.0040 


0.0039 


-0.0020 


-0.0047 


42.7% 


Optimal Hedge 


0.0028 


0.0020 


-0.0042 


-0.0075 


68.2% 


Optimal Time-Independent Hedge 


0.0010 


0.0011 


-0.0047 


-0.0073 


66.4% 


Optimal Aggressive Hedge 


0.0014 


-0.0061 


-0.0183 


-0.0252 


229.1% 



Table 1: Averaged losses achieved by the best investment product, by the Original Hedge algorithm, 
by the Optimal Hedge algorithm, by the Optimal Time-Independent Hedge algorithm, and by the 
Optimal Aggressive Hedge algorithm after one, two, three, and four months of trading. In the last 
column, we express the final averaged loss in percentage of the final averaged loss achieved by the 
best investment product. 



algorithm, where we continuously increase the weights of the loss vectors when time proceeds. For 
fixed 7 G (0, 1), we set A t = In(l/7)(* + l) 2 and j3 t = t 2 ' 5 for any t > 0. Let T > 6. Using the 
relations + I) 2 = T(T + 1)(2T + l)/6 > T 3 /3, E^(* + !) 4 ^ 2T 5 /7, and Theorem [23 

we obtain for Algorithm [3] 

6 V- 1 /, ^2,„ x 3 / T 25 D l^(t+l) 4 ln(l/ 7 ) r2 , \ 

— max > (t + l) 2 (£ t , x t -x) < — - , ; ; + - > „ 5 L [a, b) 

T(T + 1)(2T + 1) iea„ ^—^ 1 aT 3 \ki(l/y) 2 ^ T 2 - 5 v ' 'J 

J_ f D Ml h)L\a,b y 

~ aVT VMV7) 7 

The latter quantity is minimized for a* , b* , and 7* satisfying 



r=cxp (h/^K) and b * = ^P a * 
\ a*{n + p) j 2 

with a* > 0. We call the resulting method Optimal Aggressive Hedge algorithm, for which we can 
rewrite the above inequality as: 



T-l 



— % max V(t + l) 2 (£ t ,x t - x) < 3(u + p)\ /i^M. 

T(T + 1)(2T + 1) i£A„ ^—^ 1 vp-r^y ?T 



Note that the averaged regret reflects our time- varying choice of the weights A 



5 Numerical results 

We select a pool of n = 30 investment products and consider T = 31200 iterations of the methods 
that we presented. The number T is chosen in such a way that it corresponds to the number of 
transactions at a stock exchange during four months (20 trading days of 6h30 for one month), 



30 investment products (n=0.5133, p=0.5175) 



0.01 



0.005 



-0.005 



-0.01 



-0.015 



-0.02 



-0.025 



-0.03 



r 




-Best investment product 

- Original Hedge 
Optimal Hedge 

- Optimal Time-Independent Hedge 
Optimal Aggressive Hedge 



7800 



15600 
number of iterations 



23400 



31200 



Figure 1: Averaged losses X)l=i (^fc-ii x k-i) /t, t = 1,...,T, achieved by the best investment 
product (thick line), by the Original Hedge algorithm (thick dashed line), by the Optimal Hedge 
algorithm (dotted line), by the Optimal Time- Independent Hedge algorithm (thin dashed line), and 
by the Optimal Aggressive Hedge algorithm (dashed-dotted line). 



provided that there is transaction every minute. The losses £ t £ M. n , t — 0, . . . , T — 1, are randomly 
generated. The first 7800 losses (^t) t= n 9 ; that is, the losses observed during the first month, are 
realizations of a multivariate normally distributed random vector with mean pi and covariancc 
matrix E. The data (pi, E) is taken from |dat09j . The losses (l 



v 7799 

78000-l)+fcJ fe=0 



observed in month 



j, where j = 2,3,4, are realizations of a multivariate normally distributed random vector with 
the same covariance matrix E, but with a different mean pj. In our experiments, we modify each 



component Pj-ij of Hj-i as pj t . 



ij^iUj-i^i + bj, with bj small. The coefficient aj^ is negative 



with an increasing probability as j increases (namely 1/2, 3/4, and 1), reverting the performance 
of more and more products. The level of perturbation \dj t i\ is also increasing as j increases. The 
experiments are run 10 times, and the obtained losses are averaged afterwards. 

In Figure 1, we show the averaged losses, that is, 5Zfe=i (^fe-ij x fe-i) A f° r an y t — 1) achieved 
by the most successful investment product at instant t (obviously, this winning product might 
change over time), by the Original Hedge algorithm (with Frcund and Schapire's score function 
as described in ((5])), by the Optimal Hedge algorithm, by the Optimal Time-Independent Hedge 
algorithm, and by the Optimal Aggressive Hedge algorithm. Note that we show for the Optimal 
Aggressive Hedge algorithm also the quantity J2k=i {tk—i,Xk-i) ft, although we use a different 
weighting in the algorithm and in its theoretical analysis; compare with the last section. In Table 
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[TJ we give the averaged losses after each month. 

We observe that all the extensions of the Hedge algorithm that we suggested in this paper 
significantly outperform its original counterpart. Even more interestingly, the Optimal Aggressive 
Hedge algorithm achieves an averaged loss that is more than two times better than the averaged 
loss of the best investment product after 4 months. The Optimal Aggressive Hedge algorithm 
outperforms the most successful investment product, as the investment product with the best 
performance has accumulated a significant loss in an early month. This happens as we switch signs 
when we perturb the means of the distribution that we use to generate random losses. 

Compared to the other versions of the Hedge algorithm that we suggested in this paper, the 
Optimal Hedge algorithm reacts faster and thus more successful to the perturbations. This is due 
to the increasing weights At, which makes losses the more relevant the later they appear. Recall 
that all the other methods consider the losses as equally important. 
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